2018 Edition
Risk analysis is part of every decision we make when faced with uncertainty, ambiguity, and variability. Indeed, even though we have unprecedented access to information, we can't accurately predict the future. In finance, there is a fair amount of uncertainty and risk involved with estimating the future value of financial products, due to the wide variety of potential outcomes. Monte Carlo simulation (also known as the Monte Carlo Method) allows inspecting many possible outcomes of the decision making process, and can be used to assess the impact of risk: this, in turns, allows for better decision-making under uncertainty.
The main objectives we set for this Notebook are as follows:
This Notebook is inspired by Chapter 9 of the book Advanced Analytics with Spark by Josh Wills, Sandy Ryza, Sean Owen, and Uri Laserson. It is strongly suggested to read this Chapter to get a general idea of the topic of this Notebook.
Monte Carlo simulation is a computerized mathematical technique that can be applied such that it is possible to account for risk in quantitative analysis and decision making. This technique is used in many different fields, such as R&D, risk management, portfolio management, pricing derivatives, strategic planning, project planning, cost modeling and many more.
In general, MCS is a technique that "converts" uncertainty on input variables of a model into probability distributions. By combining the distributions and randomly selecting values from them, it recalculates the simulated model many times, to determine the probability of the output.
Historically, this technique was first used by scientists working on the atomic bomb: it was named after Monte Carlo, the Monaco resort town renowned for its casinos. Since its introduction in World War II, Monte Carlo simulation has been used to model a variety of physical and conceptual systems.
Monte Carlo simulation performs risk analysis by building models of possible results by substituting a range of possible input values, that constitute uncertainty, into a statistical distribution. It then computes possible outcomes repeatedly, each time using a different set of random values from the probability functions that "model" the input. Depending upon the number of random input variables and their distribution, a Monte Carlo simulation could involve thousands or tens of thousands of "rounds" before it is complete. When complete, Monte Carlo simulation produces distributions of possible outcome values.
By using probability distributions instead of actual input samples, it is possible to model more accurately uncertainty: different choices of distributions will yield different outputs.
Imagine you are the marketing manager for a firm that is planning to introduce a new product. You need to estimate the first-year net profit from this product, which might depend on:
Net profit will be calculated as $Net Profit = Sales Volume* (Selling Price - Unit cost) - Fixed costs$. Fixed costs (accounting for various overheads, advertising budget, etc.) are known to be \$ 120,000, which we assume to be deterministic. All other factors, instead, involve some uncertainty: sales volume (in units) can cover quite a large range, the selling price per unit will depend on competitor actions, which are hard to predict, and unit costs will also vary depending on vendor prices and production experience, for example.
Now, to build a risk analysis model, we must first identify the uncertain variables -- which are essentially random variables. While there's some uncertainty in almost all variables in a business model, we want to focus on variables where the range of values is significant.
from datetime import datetime, timedelta
from itertools import islice
from os import listdir
from os.path import isfile, join
from scipy import stats
from IPython.display import display
from sklearn.metrics import mean_squared_error
from statsmodels.nonparametric.kernel_density import KDEMultivariate
from statsmodels.nonparametric.kde import KDEUnivariate
from statsmodels.graphics.gofplots import qqplot_2samples
%matplotlib inline
import numpy as np
import statsmodels.api as sm
import random
import pandas as pd
import matplotlib.pyplot as plt
import scipy
import seaborn as sns
import math
from sklearn.mixture import GaussianMixture
plt.style.use(['default', 'bmh'])
base_folder = "monte-carlo-risk/"
stock_folder = base_folder + 'stocks'
factors_folder= base_folder + "factors/"
Based on a hypothetical market research you have done, you have beliefs that there are equal chances for the market to be slow, normal, or hot:
average_unit = (50000 + 75000 + 100000) / 3
average_price = (11.0 + 10.0 + 8.0) / 3
print("average unit:", average_unit)
print("average_price:", average_price)
Another uncertain variable is Unit Cost. In our illustrative example, we assume that your firm's production manager advises you that unit costs may be anywhere from \$5.50 to \$7.50, with a most likely expected cost of \$6.50. In this case, the most likely cost can be considered as the average cost.
Our next step is to identify uncertain functions -- also called functions of a random variable. Recall that Net Profit is calculated as $Net Profit = Sales Volume * (Selling Price - Unit cost) - Fixed costs$. However, Sales Volume, Selling Price and Unit Cost are all uncertain variables, so Net Profit is an uncertain function.
The simplest model to predict the Net Profit is using average of sales volume, average of selling price and average of unit cost for calculating. So, if only consider averages, we can say that the $Net Profit = 75,000*(9.66666666 - 6.5) - 120,000 \sim 117,500$.
However, as Dr. Sam Savage warns, "Plans based on average assumptions will be wrong on average." The calculated result is far from the actual value: indeed, the true average Net Profit is roughly \$93,000, as we will see later in the example.
def calNetProfit(average_unit, average_price, average_unitcost, fixed_cost):
return average_unit * (average_price - average_unitcost) - fixed_cost
average_unitcost = 6.5
fixed_cost = 120000
NetProfit = calNetProfit(average_unit, average_price, average_unitcost, fixed_cost)
print("Net profit:", NetProfit)
trueNetProfit = 93000
error = (NetProfit - trueNetProfit) / (trueNetProfit)
print("Error in percentage:", error * 100, " %")
As discussed before, the selling price and selling volume both depend on the state of the market scenario (slow/normal/hot). So, the net profit is the result of two random variables: market scenario (which in turn determines sales volumes and selling price) and unit cost.
Now, let's assume (this is an a-priori assumption we make) that market scenario follows a discrete, uniform distribution and that unit cost also follows a uniform distribution. Then, we can compute directly the values for selling price and selling volumes based on the outcome of the random variable market scenario, as shown in Section 2.1.
From these a-priori distributions, in each run (or trial) of our Monte Carlo simulation, we can generate the sample value for each random variable and use it to calculate the Net Profit. The more simulation runs, the more accurate our results will be. For example, if we run the simulation 100,000 times, the average net profit will amount to roughly \$92,600. Every time we run the simulation, a different prediction will be output: the average of such predictions will consistently be less than \$117,500, which we predicted using averages only.
Note also that in this simple example, we generate values for the market scenario and unit cost independently: we consider them to be independent random variables. This means that the eventual (and realistic!) correlation between the market scenario and unit cost variables is ignored. Later, we will learn how to be more precise and account for dependency between random variables.
# Get sales volume and price based on market scenario
# the function returns a tuple of (sales_volume, price)
def get_sales_volume_price(scenario):
# Slow market
if scenario == 0:
return (50000, 11)
# Normal market
if scenario == 1:
return (75000, 10)
# Hot market
if scenario == 2:
return (100000, 8)
Function uniform(a,b) in module random generates a number $a<=c<=b$, which is drawn from a uniform distribution.
Function randint(a,b) helps you generating an integer number $a<=c<=b$
trueNetProfit = 92600
num_simulation = 100000
netProfitAverages, errors = [], []
netProfit = 0
for i in range(0,num_simulation):
unit_cost = random.uniform(5.5, 7.5)
market_scenario = random.randint(0, 2)
sales_volume, price = get_sales_volume_price(market_scenario)
netProfit += calNetProfit(sales_volume, price, unit_cost, fixed_cost)
netProfitAverages.append(netProfit / (i+1))
errors.append(100 * abs(netProfitAverages[-1] - trueNetProfit) / (trueNetProfit))
print("Average net profit:", netProfitAverages[-1])
print("Error in percentage:", errors[-1], " %")
plt.figure(figsize=(20,10))
plt.subplot(1, 2, 1)
plt.plot(netProfitAverages)
plt.axhline(y=trueNetProfit, color='red', linestyle='dotted', linewidth=3)
plt.yticks(list(plt.yticks()[0]) + [trueNetProfit])
plt.xlabel("Number of simulations")
plt.ylabel("Average net profit")
plt.grid(linestyle='--', axis="y")
plt.subplot(1, 2, 2)
plt.plot(errors)
plt.xlabel("Number of simulations")
plt.ylabel("Error")
plt.grid(linestyle='--', axis="y")
plt.show()
In what follows, we summarize the most common probability distributions that are used as a-priori distributions for input random variables:
Normal/Gaussian Distribution: this is a continuous distribution applied in situations where the mean and the standard deviation of a given input variable are given, and the mean represents the most probable value of the variable. In other words, values "near" the mean are most likely to occur. This is symmetric distribution, and it is not bounded in its co-domain. It is very often used to describe natural phenomena, such as people’s heights, inflation rates, energy prices, and so on and so forth. An illustration of a normal distribution is given below:
![]()
Lognormal Distribution: this is a distribution which is appropriate for variables taking values in the range $[0, \infty]$. Values are positively skewed, not symmetric like a normal distribution. Examples of variables described by some lognormal distributions include, for example, real estate property values, stock prices, and oil reserves. An illustration of a lognormal distribution is given below:
![]()
Triangular Distribution: this is a continuous distribution with fixed minimum and maximum values. It is bounded by the minimum and maximum values and can be either symmetrical (the most probable value = mean = median) or asymmetrical. Values around the most likely value (e.g. the mean) are more likely to occur. Variables that could be described by a triangular distribution include, for example, past sales history per unit of time and inventory levels. An illustration of a triangular distribution is given below:
![]()
Uniform Distribution: this is a continuous distribution bounded by known minimum and maximum values. In contrast to the triangular distribution, the likelihood of occurrence of the values between the minimum and maximum is the same. In other words, all values have an equal chance of occurring, and the distribution is simply characterized by the minimum and maximum values. Examples of variables that can be described by a uniform distribution include manufacturing costs or future sales revenues for a new product. An illustration of the uniform distribution is given below:
![]()
Exponential Distribution: this is a continuous distribution used to model the time that pass between independent occurrences, provided that the rate of occurrences is known. An example of the exponential distribution is given below:
![]()
Discrete Distribution : for this kind of distribution, the "user" defines specific values that may occur and the likelihood of each of them. An example might be the results of a lawsuit: 20% chance of positive verdict, 30% change of negative verdict, 40% chance of settlement, and 10% chance of mistrial.
We hope that by now you have a good understanding about Monte Carlo simulation. Next, we apply this method to a real use case: financial risk estimation.
Imagine that you are an investor on the stock market. You plan to buy some stocks and you want to estimate the maximum loss you could incur after two weeks of investing. This is the quantity that the financial statistic "Value at Risk" (VaR) seeks to measure. VaR is defined as a measure of investment risk that can be used as a reasonable estimate of the maximum probable loss for a value of an investment portfolio, over a particular time period. A VaR statistic depends on three parameters: a portfolio, a time period, and a confidence level. A VaR of 1 million dollars with a 95% confidence level over two weeks, indicates the belief that the portfolio stands only a 5% chance of losing more than 1 million dollars over two weeks. VaR has seen widespread use across financial services organizations. This statistic plays a vital role in determining how much cash investors must hold to meet the credit ratings that they seek. In addition, it is also used to understand the risk characteristics of large portfolios: it is a good idea to compute the VaR before executing trades, such that it can help take informed decisions about investments.
Our goal is calculating VaR of two weeks interval with 95% confidence level and the associated VaR confidence interval.
In this use case, we will use some terms that might require a proper definition, given the domain. This is what we call the Domain Knowledge.
We have a list of instruments that we plan to invest in. The historical data of each instrument has been collected for you. For simplicity, assume that the returns of instruments at a given time, depend on 4 market factors only:
Our goal is building a model to predict the loss after two weeks' time interval with confidence level set to 95%.
As a side note, it is important to realize that the approach presented in this Notebook is a simplified version of what would happen in a real Financial firm. For example, the returns of instruments at a given time often depend on more than 4 market factors only! Moreover, the choice of what constitute an appropriate market factor is an art!
The stock data can be downloaded (or scraped) from Yahoo! by making a series of REST calls. The data includes multiple files. Each file contains the historical information of each instrument that we want to invest in. The data is in the following format (with some samples):
Date, Open, High, Low, Close, Volume, Adj Close
2016-01-22,66.239998,68.07,65.449997,67.860001,137400,67.860001
2016-01-21,65.410004,66.18,64.459999,65.050003,148000,65.050003
2016-01-20,64.279999,66.32,62.77,65.389999,141300,65.389999
2016-01-19,67.720001,67.989998,64.720001,65.379997,178400,65.379997
The data of GSPC and IXIC values (our two first market factors) are also available on Yahoo! and use the very same format.
The crude oil and treasure bonds data is collected from investing.com, and has a different format, as shown below (with some samples):
Date Price Open High Low Vol. Change %
Jan 25, 2016 32.17 32.36 32.44 32.10 - -0.59%
Jan 24, 2016 32.37 32.10 32.62 31.99 - 0.54%
Jan 22, 2016 32.19 29.84 32.35 29.53 - 9.01%
Jan 21, 2016 29.53 28.35 30.25 27.87 694.04K 11.22%
Jan 20, 2016 26.55 28.33 28.58 26.19 32.11K -6.71%
Jan 19, 2016 28.46 29.20 30.21 28.21 188.03K -5.21%
In our use case, the factors' data will be used jointly to build a statistical model: as a consequence, we first need to preprocess the data to proceed.
In this Notebook, all data files have been downloaded for you, such that you can focus on pre-processing. Next, we will:
We need two functions to read and parse data from Yahoo! and Investing.com respectively. We are interested only in information about the time and the corresponding returns of a factor or an instrument: as a consequence, we will project away many columns of our RAW data, and keep only the information we are interested in.
The 3000-instrument and the 4-factor history are small enough to be read and processed locally: we do not need to use the power of parallel computing to proceed. Note that this is true also for larger cases with hundreds of thousands of instruments and thousands of factors. The need for a distributed system like Spark comes in when actually running the Monte Carlo simulations, which can require massive amounts of computation on each instrument.
datetime object by using the function strptime(<string>, <dtime_format>). In this case, the datetime format is "%b %d, %Y". For more information, please follow this link.
In the next cell, we simply copy data from our HDFS cluster (that contains everything we need for this Notebook) to the instance (a Docker container) running your Notebook. This means that you will have "local" data that you can process without using Spark. Note the folder location: find and verify that you have correctly downloaded the files!
! [ -d monte-carlo-risk ] || (echo "Downloading prepared data from HDFS. Please wait..." ; hdfs dfs -copyToLocal /datasets/monte-carlo-risk . ; echo "Done!";)
# read data from local disk
def readInvestingDotComHistory(fname):
def process_line(line):
cols = line.split('\t')
date = datetime.strptime(cols[0], "%b %d, %Y")
value = float(cols[1])
return (date, value)
with open(fname) as f:
content_w_header = f.readlines()
# remove the first line
# and reverse lines to sort the data by date, in ascending order
content = content_w_header[:0:-1]
return list(map(process_line , content))
factor1_files = ['crudeoil.tsv', 'us30yeartreasurybonds.tsv']
factor1_files = map(lambda fn: factors_folder + fn, factor1_files)
factors1 = [readInvestingDotComHistory(f) for f in factor1_files]
print("The first five elements of the first factor (crude oil) are:")
display(pd.DataFrame(factors1[0][:5], columns=['Date', 'Value']))
Now, the data structure factors1 is a list, containing data that pertains to two (out of a total of four) factors that influence the market, as obtained by investing.com. Each element in the list is a tuple, containing some sort of timestamp, and the value of one of the two factors discussed above. From now on, we call these elements "records" or "entries". Visually, factors1 looks like this:
| 0 (crude oil) | 1 (US bonds) |
|---|---|
| time_stamp, value | time_stamp, value |
| ... | ... |
| time_stamp, value | time_stamp, value |
| ... | ... |
# read data from local disk
def readYahooHistory(fname):
def process_line(line):
cols = line.split(',')#
date = datetime.strptime(cols[0], "%Y-%m-%d")#
value = float(cols[1])#
return (date, value)
with open(fname) as f:
content_w_header = f.readlines()
# remove the first line
# and reverse lines to sort the data by date, in ascending order
content = content_w_header[:0:-1]
return list(map(process_line , content))
factor2_files = ['GSPC.csv', 'IXIC.csv']
factor2_files = map(lambda fn: factors_folder + fn, factor2_files)
factors2 = [readYahooHistory(f) for f in factor2_files]
print("GSPC:")
display(pd.DataFrame(factors2[0][:5], columns=['Date', 'Value']))
print("IXIC:")
display(pd.DataFrame(factors2[1][:5], columns=['Date', 'Value']))
print("January 7th, 1950 is", datetime(1950, 1, 7).strftime("%A"))
print("January 8th, 1950 is", datetime(1950, 1, 8).strftime("%A"))
print("Time range of GSPC: [", factors2[0][0][0].strftime("%Y"), ",", factors2[0][-1][0].strftime("%Y"), "]")
print("Time range of IXIC: [", factors2[1][0][0].strftime("%Y"), ",", factors2[1][-1][0].strftime("%Y"), "]")
Now, the data structure factors2 is again list, containing data that pertains to the next two (out of a total of four) factors that influence the market, as obtained by Yahoo!. Each element in the list is a tuple, containing some sort of timestamp, and the value of one of the two factors discussed above. Visually, factors2 looks like this:
| 0 (GSPC) | 1 (IXIC) |
|---|---|
| time_stamp, value | time_stamp, value |
| ... | ... |
| time_stamp, value | time_stamp, value |
| ... | ... |
Next, we prepare the data for the instruments we consider in this Notebook (i.e., the stocks we want to invest in).
##### the data is similar to the one in Yahoo...
def process_stock_file(fname):
try:
return readYahooHistory(fname)
except Exception as e:
raise e
return None
# select path of all stock data files in "stock_folder"
files = [join(stock_folder, f) for f in listdir(stock_folder) if isfile(join(stock_folder, f))]
# assume that we invest only the first 35 stocks (for faster computation)
files = files[:35]
# read each line in each file, convert it into the format: (date, value)
rawStocks = [process_stock_file(f) for f in files]
print("Initial number of stocks:", len(rawStocks))
# select only instruments which have more than 5 years of history
# Note: the number of business days in a year is 260
number_of_years = 5
rawStocks = list(filter(lambda instrument: len(instrument) >= 260 * number_of_years , rawStocks))
print("Number of stocks after filtering:", len(rawStocks))
# For testing, print the first 5 entry of the first stock
print("The first 5 entries of the first stock:")
display(pd.DataFrame(rawStocks[0][:5], columns=['Date', 'Value']))
Different types of instruments may trade on different days, or the data may have missing values for other reasons, so it is important to make sure that our different histories align. First, we need to trim all of our time series to the same region in time. Then, we need to fill in missing values. To deal with time series that have missing values at the start and end dates in the time region, we simply fill in those dates with nearby values in the time region.
# note that the data of crude oil and treasury is only available starting from 26/01/2006
start = datetime(year=2009, month=1, day=23)
end = datetime(year=2014, month=1, day=23)
def isInTimeRegion(entry):
(date, value) = entry
return date >= start and date <= end
def trimToRegion(history, start, end):
# only select entries which are in the time region
trimmed = list(filter(isInTimeRegion, history))
# if the data has incorrect time boundaries, add time boundaries
if trimmed[0][0] != start:
trimmed.insert(0, (start, trimmed[0][1]))
if trimmed[-1][0] != end:
trimmed.append((end, trimmed[-1][1]))
return trimmed
# test our function
trimmedStock0 = trimToRegion(rawStocks[0], start, end)
# the first 5 records of stock 0
print('The first 5 records of stock 0:')
display(pd.DataFrame(trimmedStock0[:5], columns=['Date', 'Value']))
# the last 5 records of stock 0
print('The last 5 records of stock 0:')
display(pd.DataFrame(trimmedStock0[-5:], columns=['Date', 'Value']))
assert(trimmedStock0[0][0] == start), "the first record must contain the price in the first day of time interval"
assert(trimmedStock0[-1][0] == end), "the last record must contain the price in the last day of time interval"
We expect that we have the price of instruments and factors in each business day. Unfortunately, there are many missing values in our data: this means that we miss data for some days, e.g. we have data for the Monday of a certain week, but not for the subsequent Tuesday. So, we need a function that helps filling these missing values.
Next, we provide to you the function to fill missing value: read it carefully!
def fillInHistory(history, start, end):
curr = history
filled = []
idx = 0
curDate = start
numEntries = len(history)
while curDate < end:
# if the next entry is in the same day
# or the next entry is at the weekend
# but the curDate has already skipped it and moved to the next monday
# (only in that case, curr[idx + 1][0] < curDate )
# then move to the next entry
while idx + 1 < numEntries and curr[idx + 1][0] <= curDate:
idx +=1
# only add the last value of instrument in a single day
# check curDate is weekday or not
# 0: Monday -> 5: Saturday, 6: Sunday
if curDate.weekday() < 5:
filled.append((curDate, curr[idx][1]))
# move to the next business day
curDate += timedelta(days=1)
# skip the weekends
elif curDate.weekday() >= 5:
# if curDate is Sat, skip 2 days, otherwise, skip 1 day
curDate += timedelta(days=(7-curDate.weekday()))
return filled
# trim into a specific time region
# and fill up the missing values
stocks = list(map(lambda stock:
fillInHistory(
trimToRegion(stock, start, end),
start, end),
rawStocks))
# merge two factors, trim each factor into a time region
# and fill up the missing values
allfactors = factors1 + factors2
factors = list(map(lambda stock:
fillInHistory(
trimToRegion(stock, start, end),
start, end),
allfactors
))
# test our code
print("The first 5 records of stock 0:")
display(pd.DataFrame(stocks[0][:5], columns=['Date', 'Value']))
print("The last 5 records of stock 0:")
display(pd.DataFrame(stocks[0][-5:], columns=['Date', 'Value']))
print("The first 5 records of factor 0:")
display(pd.DataFrame(factors[0][:5], columns=['Date', 'Value']))
print("The last 5 records of factor 0:")
display(pd.DataFrame(factors[0][-5:], columns=['Date', 'Value']))
Recall that Value at Risk (VaR) deals with losses over a particular time horizon. We are not concerned with the absolute prices of instruments, but how those prices change over a given period of time. In our project, we will set that length to two weeks: we use the sliding window method to transform time series of prices into an overlapping sequence of price change over two-week intervals.
The figure below illustrates this process. The returns of market factors after each two-week interval is calculated in the very same way.
def buildWindow(seq, k=2):
"Returns a sliding window (of width k) over data from iterable data structures"
" s -> (s0,s1,...s[k-1]), (s1,s2,...,sk), ... "
it = iter(seq)
result = tuple(islice(it, k))
if len(result) == k:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
def calculateReturn(window):
# return the change of value after two weeks
return window[-1][1] - window[0][1]
def twoWeekReturns(history):
# we use 11 instead of 14 to define the window
# because financial data does not include weekends
# so there are 10 days
# but we need to include the 11th day to calculate the change in values after each two-weeks as indicated above
return [calculateReturn(entry) for entry in buildWindow(history, 11)]
stocksReturns = list(map(twoWeekReturns, stocks))
factorsReturns = list(map(twoWeekReturns, factors))
# test our functions
print("the first 5 returns of stock 0:")
display(pd.DataFrame(stocksReturns[0][:5], columns=['Return']))
print("the last 5 returns of stock 0:")
display(pd.DataFrame(stocksReturns[0][-5:], columns=['Return']))
windowOfStock0 = list(buildWindow(stocks[0], 11))[0]
display(pd.DataFrame({'Date': [x[0].strftime("%A, %B %d, %Y") for x in windowOfStock0] ,'Value': [x[1] for x in windowOfStock0]}, columns=['Date','Value']))
Alright! Now we have data that is properly aligned to start the training process: stocks' returns and factors' returns, per time windows of two weeks. Next, we will apply the MCS method.
Next, we overview the steps that you have to follow to build a model of your data, and then use Monte Carlo simulations to produce output distributions:
In our simulation, we will use a simple linear model. By our definition of return, a factor return is a change in the value of a market factor over a particular time period, e.g. if the value of the S&P 500 moves from 2000 to 2100 over a time interval, its return would be 100.
A vector that contains the return of 4 market factors is called a market factor vector. Generally, instead of using this vector as features, we derive a set of features from simple transformation of it. In particular, a vector of 4 values is transformed into a vector of length $m$ by function $F$. In the simplest case $F(v) = v$.
Denote $v_t$ the market factor vector, and $f_t$ the transformed features of $v_t$ at time $t$.
$f_{tj}$ is the value of feature $j$ in $f_t$.
Denote $r_{it}$ the return of instrument $i$ at time $t$ and $c_i$ the intercept term of instrument $i$.
We will use a simple linear function to calculate $r_{it}$ from $f_t$:
$$ r_{it} = c_i + \sum_{j=1}^{m}{w_{ij}*f_{tj}} $$
where $w_{ij}$ is the weight of feature $j$ for instrument $i$.
All that above means that given a market factor vector, we have to apply featurization and then use the result as a surrogate for calculating the return of the instruments, using the above linear function.
There are two questions that we should consider: how we apply featurization to a factor vector? and how to pick values for $w_{ij}$?
How we apply featurization to a factor vector? In fact, the instruments' returns may be non-linear functions of the factor returns. So, we should not use factor returns as features in the above linear function. Instead, we transform them into a set of features with different size. In this Notebook, we can include some additional features in our model that we derive from non-linear transformations of the factor returns. We will try adding two more features for each factor return: its square and its square root values. So, we can still assume that our model is a linear model in the sense that the response variable is a linear function of the new features. Note that the particular feature transformation described here is meant to be an illustrative example of some of the options that are available: it shouldn't be considered as the state of the art in predictive financial modeling!!.
How to pick values for $w_{ij}$?
For all the market factor vectors in our historical data, we transform them to feature vectors. Now, we have feature vectors in many two-week intervals and the corresponding instrument's returns in these intervals. We can use Ordinary Least Square (OLS) regression model to estimate the weights for each instrument such that our linear function can fit to the data. The parameters for OLS function are:
x: The collection of columns where each column is the value of a feature in many two-week intervaly: The return of an instrument in the corresponding time interval of x.The figure below shows the basic idea of the process to build a statistical model for predicting the returns of stock X.
def transpose(matrix):
return np.matrix.tolist(np.array(matrix).T)
# test function
assert (transpose([[1,2,3], [4,5,6], [7,8,9]]) == [[1, 4, 7], [2, 5, 8], [3, 6, 9]]), "Function transpose runs incorrectly"
def featurize(factorReturns):
squaredReturns = np.sign(factorReturns) * np.power(factorReturns, 2)
squareRootedReturns = np.sign(factorReturns) * np.sqrt(np.abs(factorReturns))
# concat new features
return np.matrix.tolist(np.concatenate((squaredReturns, squareRootedReturns, factorReturns)).flatten())
# test our function
assert (featurize([4, -9, 25]) == [16, -81, 625, 2, -3, 5, 4, -9, 25]), "Function runs incorrectly"
def estimateParams(y, x):
return sm.OLS(y, x).fit().params
# transpose factorsReturns
factorMat = transpose(factorsReturns)
# featurize each row of factorMat
factorFeatures = list(map(featurize, factorMat))
# OLS require parameter is a numpy array
factor_columns = np.array(factorFeatures)
#add a constant - the intercept term for each instrument i.
factor_columns = sm.add_constant(factor_columns, prepend=True)
# estimate weights
weights = [estimateParams(stockReturns, factor_columns) for stockReturns in stocksReturns]
print("Weights:")
display(pd.DataFrame(weights, columns=['Factor {0}'.format(i+1) for i in range(np.array(weights).shape[1])]))
print("The shape of weights:", np.array(weights).shape)
Since we cannot define the distributions for the market factors directly, we can only approximate their distribution. The best way to do that, is plotting their value. However, these values may fluctuate quite a lot.
Next, we show how to use the Kernel density estimation (KDE) technique to approximate such distributions. In brief, kernel density estimation is a way of smoothing out a histogram: this is achieved by assigning (or centering) a probability distribution (usually a normal distribution) to each data point, and then summing. So, a set of two-week-return samples would result in a large number of "super-imposed" normal distributions, each with a different mean.
To estimate the probability density at a given point, KDE evaluates the PDFs of all the normal distributions at that point and takes their average. The smoothness of a kernel density plot depends on its bandwidth, and the standard deviation of each of the normal distributions. For a brief introduction on KDE, please refer to this link.
def get_domain_and_density(samples):
vmin = min(samples)
vmax = max(samples)
stddev = np.std(samples)
domain = np.arange(vmin, vmax, (vmax-vmin)/100)
# a simple heuristic to select bandwidth
bandwidth = 1.06 * stddev * pow(len(samples), -.2)
# estimate density
kde = KDEUnivariate(samples)
kde.fit(bw=bandwidth)
density = kde.evaluate(domain)
return domain, density
def plotDistribution(samples, plotTitle=""):
domain, density = get_domain_and_density(samples)
# plot
plt.figure(figsize=(15,5))
plt.plot(domain, density)
plt.xlabel("Two Week Return ($)")
plt.ylabel("Density")
plt.grid(linestyle='--', axis="y")
if (plotTitle):
plt.title(plotTitle)
plt.show()
plotDistribution(factorsReturns[0], "Two-week Crude oil returns distribution")
plotDistribution(factorsReturns[1], "Two-week Treasury bonds returns distribution")
plotDistribution(factorsReturns[2], "Two-week GSPC returns distribution")
plotDistribution(factorsReturns[3], "Two-week IXIC returns distribution")
For the sake of simplicity, we can say that our smoothed versions of the returns of each factor can be represented quite well by a normal distribution. Of course, more exotic distributions, perhaps with fatter tails, could fit more closely the data, but it is outside the scope of this Notebook to proceed in this way.
Now, the simplest way to sample factors returns is to use a normal distribution for each of the factors, and sample from these distributions independently. However, this approach ignores the fact that market factors are often correlated. For example, when the price of crude oil is down, the price of treasury bonds is down too. We can check our data to verify about the correlation.
correlation = np.corrcoef(factorsReturns)
correlation
mask = np.zeros_like(correlation)
mask[np.triu_indices_from(correlation)] = True
with sns.axes_style("white"):
ax = sns.heatmap(correlation,
vmin=0,
vmax=1,
cmap="Blues",
xticklabels=['Crude oil', 'Treasury bonds', 'GSPC', ' '],
yticklabels=[' ', 'Treasury bonds', 'GSPC', 'IXIC'],
mask=mask,
cbar_kws={"orientation": "horizontal"})
plt.figure(figsize=(20,10))
plt.plot(factorsReturns[2], 'r' , label='GSPC factor')
plt.plot(factorsReturns[3], 'b' , label='IXIC factor')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel('Time window')
plt.ylabel('Return')
plt.show()
plt.figure(figsize=(20,10))
plt.plot(np.array(factorsReturns[2]) * max(factorsReturns[3]) / max(factorsReturns[2]), 'r' , label='GSPC factor')
plt.plot(factorsReturns[3], 'b' , label='IXIC factor')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel('Time window')
plt.ylabel('Return')
plt.show()
plt.figure(figsize=(20,10))
plt.plot(factorsReturns[0], 'c' , label='Crude oil')
plt.plot(factorsReturns[1], 'g' , label='Treasury bonds')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel('Time window')
plt.ylabel('Return')
plt.show()
plt.figure(figsize=(20,10))
plt.plot(factorsReturns[0], 'c', label='Crude oil')
plt.plot(np.array(factorsReturns[1]) * max(factorsReturns[0]) / max(factorsReturns[1]), 'g', label='Treasury bonds')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel('Time window')
plt.ylabel('Return')
plt.show()
The multivariate normal distribution can help here by taking the correlation information between the factors into account. Each sample from a multivariate normal distribution can be thought of as a vector. Given values for all of the dimensions but one, the distribution of values along that dimension is normal. But, in their joint distribution, the variables are not independent.
For this use case, we can write:
$$ \left(\begin{array}{c}f_{1}\\f_{2}\\f_{3}\\f_{4} \end{array}\right) \sim N \left[ \left( \begin{array}{c} \mu_1\\ \mu_2 \\ \mu_3 \\ \mu_4 \end{array} \right), \left( \begin{array}{cccc} \sigma^2_1 & \rho_{12} \sigma_1\sigma_2 & \rho_{13} \sigma_1\sigma_3 & \rho_{14} \sigma_1\sigma_4 \\ \rho_{12}\sigma_2\sigma_1 & \sigma^2_2 & \rho_{23} \sigma_2\sigma_3 & \rho_{24} \sigma_2\sigma_4\\ \rho_{13} \sigma_3\sigma_1 & \rho_{23} \sigma_3\sigma_2 & \sigma^2_3 & \rho_{34} \sigma_3\sigma_4 \\ \rho_{14} \sigma_4\sigma_1 & \rho_{24} \sigma_4\sigma_2 & \rho_{34} \sigma_3\sigma_4 & \sigma_4^2 \\ \end{array} \right) \right] $$
Or,
$$ f_t \sim N(\mu, \sum) $$
Where $f_1$, $f_2$, $f_3$ and $f_4$ are the market factors, $\sigma_i$ is the standard deviation of factor $i$, $\mu$ is a vector of the empirical means of the returns of the factors and $\sum$ is the empirical covariance matrix of the returns of the factors.
The multivariate normal is parameterized with a mean along each dimension and a matrix describing the covariance between each pair of dimensions. When the covariance matrix is diagonal, the multivariate normal reduces to sampling along each dimension independently, but placing non-zero values in the off-diagonals helps capture the relationships between variables. Whenever having the mean of this multivariate normal distribution and its covariance matrix, we can generate the sample values for market factors.
Next, we will calculate the mean and the covariance matrix of this multivariate normal distribution from the historical data.
np.cov can help calculating covariance matrix. Function np.random.multivariate_normal(<mean>, <cov>) is often used for generating samples.
factorCov = np.cov(factorsReturns)
factorMeans = [sum(u) / len(u) for u in factorsReturns]
sample = np.random.multivariate_normal(factorMeans, factorCov)
print("Covariance matrix: ", factorCov)
print("Mean: ", factorMeans)
print("Sample: ", sample)
def compare(samples, factorReturns, plotTitle):
domainSamples, densitySamples = get_domain_and_density(samples) # get the domain and the density of the samples
domainFactorReturns, densityFactorReturns = get_domain_and_density(factorReturns) # get the domain and the density of the factor returns
print("----", plotTitle, "----")
MSE = mean_squared_error(densityFactorReturns, densitySamples) # Mean squared error
print("Mean Squared Error:", MSE)
entropy = stats.entropy(densityFactorReturns, densitySamples) # Relative entropy
print("Relative Entropy:", entropy)
# plot
fig = plt.figure(figsize=(15,10))
fig.add_subplot(211)
plt.plot(domainFactorReturns, densityFactorReturns, label="Real factor")
plt.plot(domainSamples, densitySamples, label="Appoximation")
plt.xlabel("Two Week Return ($)")
plt.ylabel("Density")
plt.grid(linestyle='--', axis="y")
plt.title(plotTitle)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
#Q-Q plot
ax = fig.add_subplot(212)
qqplot_2samples(np.array(factorReturns), np.array(samples), line='45', ax=ax)
ax.set_xlabel('Real Factor')
ax.set_ylabel('Approximation')
ax.set_title('A Q–Q plot of the real factor versus its approximation.')
plt.show()
samples = np.random.multivariate_normal(factorMeans, factorCov, len(factorsReturns[0])).T
compare(samples[0], factorsReturns[0], "Two-week Crude oil returns distribution")
compare(samples[1], factorsReturns[1], "Two-week Treasury bonds returns distribution")
compare(samples[2], factorsReturns[2], "Two-week GSPC returns distribution")
compare(samples[3], factorsReturns[3], "Two-week IXIC returns distribution")
We define some functions that helps us calculating VaR 5%. You will see that the functions below are pretty complicated! This is why we provide a solution for you: however, study them well!!
The basic idea of calculating VaR 5% is that we need to find a value such that only 5% of the losses are bigger than it. That means the 5th percentile of the losses should be VaR 5%.
VaR can sometimes be problematic though, since it does give any information on the extent of the losses which can exceed the VaR estimate. CVar is an extension of VaR that is introduced to deal with this problem. Indeed, CVaR measures the expected value of the loss in those cases where VaR estimate has been exceeded.
def fivePercentVaR(trials):
numTrials = trials.count()
topLosses = trials.takeOrdered(max(round(numTrials/20.0), 1))
return topLosses[-1]
# an extension of VaR
def fivePercentCVaR(trials):
numTrials = trials.count()
topLosses = trials.takeOrdered(max(round(numTrials/20.0), 1))
return sum(topLosses)/len(topLosses)
def bootstrappedConfidenceInterval(
trials, computeStatisticFunction,
numResamples, pValue):
stats = []
for i in range(0, numResamples):
resample = trials.sample(True, 1.0)
stats.append(computeStatisticFunction(resample))
stats.sort()
lowerIndex = int(numResamples * pValue / 2 - 1)
upperIndex = int(np.ceil(numResamples * (1 - pValue / 2)))
return (stats[lowerIndex], stats[upperIndex])
Next, we will run the Monte Carlo simulation 10,000 times, in parallel using Spark. Since your cluster has 12 cores (two Spark worker nodes, each with 6 cores), we can set parallelism = 12 to dispatch simulation on these cores, across the two machines (remember, those are not really "physical machines", they are Docker containers running in our infrastructure).
# RUN SILMULATION
# We added a parameter 'distribFunc' to easily change our disribution later.
# The parameter 'df' meaning 'degrees of freedom' is useful for some distributions
def simulateTrialReturns(numTrials, factorMeans, factorCov, weights, distribFunc = np.random.multivariate_normal, df=None):
trialReturns = []
for i in range(0, numTrials):
# generate sample of factors' returns
trialFactorReturns = (
distribFunc(factorMeans, factorCov) if (df is None) else distribFunc(factorMeans, factorCov, df=df)
)
# featurize the factors' returns
trialFeatures = featurize(trialFactorReturns)
# insert weight for intercept term
trialFeatures.insert(0,1)
trialTotalReturn = 0
# calculate the return of each instrument
# then calulate the total of return for this trial features
trialTotalReturn = np.sum(np.dot(weights, trialFeatures))
trialReturns.append(trialTotalReturn)
return trialReturns
parallelism = 12
numTrials = 10000
trial_indexes = list(range(0, parallelism))
seedRDD = sc.parallelize(trial_indexes, parallelism)
bFactorWeights = sc.broadcast(weights)
trials = seedRDD.flatMap(lambda idx: \
simulateTrialReturns(
max(int(numTrials/parallelism), 1),
factorMeans, factorCov,
bFactorWeights.value
))
trials.cache()
valueAtRisk = fivePercentVaR(trials)
conditionalValueAtRisk = fivePercentCVaR(trials)
print("Value at Risk(VaR) 5%:", valueAtRisk)
print("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
plotDistribution(trials.collect(), "Two-week returns distribution")
The value of VaR depends on how many invested stocks and the chosen distribution of random variables. Assume that we get VaR 5% = -2.66, that means that there is a 0.05 probability that the portfolio will fall in value by more than \$2.66 over a two weeks' period if there is no trading. In other words, the loses are less than \$2.66 over two weeks' period with 95% confidence level. When a loss over two weeks is more than \$2.66, we call it **failure** (or **exception**). Informally, because of 5% probability, we expect that there are only $0.05*W$ failures out of total $W$ windows.
In general, the error in a Monte Carlo simulation should be proportional to 1/sqrt(n), where n is the number of trials. This means, for example, that quadrupling the number of trials should approximately cut the error in half. A good way to check the quality of a result is backtesting on historical data. Backtesting is a statistical procedure where actual losses are compared to the estimated VaR. For instance, if the confidence level used to calculate VaR is 95% (or VaR 5%), we expect only 5 failures over 100 two-week time windows.
The most common test of a VaR model is counting the number of VaR failures, i.e., in how many windows, the losses exceed VaR estimate. If the number of exceptions is less than selected confidence level would indicate, the VaR model overestimates the risk. On the contrary, if there are too many exceptions, the risk is underestimated. However, it's very hard to observe the amount of failures suggested by the confidence level exactly. Therefore, people try to study whether the number of failures is reasonable or not, or will the model be accepted or rejected.
One common test is Kupiec's proportion-of-failures (POF) test. This test considers how the portfolio performed at many historical time intervals and counts the number of times that the losses exceeded the VaR. The null hypothesis is that the VaR is reasonable, and a sufficiently extreme test statistic means that the VaR estimate does not accurately describe the data. The test statistic is computed as:
$$ -2ln\Bigg(\frac{(1-p)^{T-x}p^x}{(1-\frac{x}{T})^{T-x}(\frac{x}{T})^x}\Bigg) $$
where:
$p$ is the quantile-of-loss of the VaR calculation (e.g., in VaR 5%, p=0.05),
$x$ (the number of failures) is the number of historical intervals over which the losses exceeded the VaR
$T$ is the total number of historical intervals considered
Or we can expand out the log for better numerical stability:
$$ \begin{equation} -2\Big((T-x)ln(1-p)+x*ln(p)-(T-x)ln(1-\frac{x}{T})-x*ln(\frac{x}{T})\Big) \end{equation} $$
If we assume the null hypothesis that the VaR is reasonable, then this test statistic is drawn from a chi-squared distribution with a single degree of freedom. By using Chi-squared distribution, we can find the p-value accompanying our test statistic value. If p-value exceeds the critical value of the Chi-squared distribution, we do have sufficient evidence to reject the null hypothesis that the model is reasonable. Or we can say, in that case, the model is considered as inaccurate.
For example, assume that we calculate VaR 5% (the confidence level of the VaR model is 95%) and get value VaR = 2.26. We also observed 50 exceptions over 500 time windows. Using the formula above, the test statistic p-value is calculated and equal to 8.08. Compared to 3.84, the critical value of Chi-squared distribution with one degree of freedom at probability 5%, the test statistic is larger. So, the model is rejected. The critical values of Chi-squared can be found by following this link.
However, in this Notebook, it's not a good idea to find the corresponding critical value by looking in a "messy" table, especially when we need to change the confidence level. Instead, from p-value, we will calculate the probability of the test statistic in Chi-square thanks to some functions in package scipy. If the calculated probability is smaller than the quantile of loss (e.g, 0.05), the model is rejected and vice versa.
def countFailures(stocksReturns, valueAtRisk):
failures = 0
stocksReturns = np.array(stocksReturns)
# iterate over time intervals
for i in range(0, len(stocksReturns[0])):
# calculate the losses in each time interval
loss = np.sum(stocksReturns[:,i])
# if the loss exceeds VaR
if loss < valueAtRisk:
failures += 1
return failures
def kupiecTestStatistic(total, failures, confidenceLevel):
if (failures == total):
return 0. #The model is not reasonable
if (failures == 0):
return 1.
failureRatio = failures / total
logNumer = (total - failures) * math.log(1 - confidenceLevel) + failures * math.log(confidenceLevel)
logDenom = (total - failures) * math.log(1 - failureRatio) + failures * math.log(failureRatio)
return -2 * (logNumer - logDenom)
# test the function
assert (round(kupiecTestStatistic(250, 36, 0.1), 2) == 4.80), "function kupiecTestStatistic runs incorrectly"
Now we can find the p-value accompanying our test statistic value.
def kupiecTestPValue(stocksReturns, valueAtRisk, confidenceLevel):
failures = countFailures(stocksReturns, valueAtRisk)
N = len(stocksReturns)
print("The number of failures:", failures)
total = len(stocksReturns[0])
print("The failures percentage: ", 100 * (failures / total), "%")
testStatistic = kupiecTestStatistic(total, failures, confidenceLevel)
#return 1 - stats.chi2.cdf(testStatistic, 1.0)
return stats.chi2.sf(testStatistic, 1.0)
varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
print("Value at Risk(VaR) 5%:", valueAtRisk)
print("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
print("VaR confidence interval: " , varConfidenceInterval)
print("CVaR confidence interval: " , cvarConfidenceInterval)
print("Kupiec test p-value: " , kupiecTestPValue(stocksReturns, valueAtRisk, 0.05))
# We added a parameter 'distribFunc' to easily change our disribution later.
# The parameter 'df' meaning 'degrees of freedom' is useful for some distributions
# The parameter 'plotDistrib' is a boolean that indicates whether we want to plot the compairaison of the factors with their approximations or not.
def invest(numberOfStocks, numTrials = 10000, distribFunc = np.random.multivariate_normal, plotDistrib = False, df=None):
# select path of all stock data files in "stock_folder"
files = [join(stock_folder, f) for f in listdir(stock_folder) if isfile(join(stock_folder, f))]
# assume that we invest only the first 'numberOfStocks' stocks (for faster computation)
files = files[:numberOfStocks]
# read each line in each file, convert it into the format: (date, value)
rawStocks = [process_stock_file(f) for f in files]
# select only instruments which have more than 5 years of history
# Note: the number of business days in a year is 260
number_of_years = 5
rawStocks = list(filter(lambda instrument: len(instrument) >= 260 * number_of_years , rawStocks))
# note that the data of crude oil and treasury is only available starting from 26/01/2006
start = datetime(year=2009, month=1, day=23)
end = datetime(year=2014, month=1, day=23)
# trim into a specific time region
# and fill up the missing values
stocks = list(map(lambda stock:
fillInHistory(
trimToRegion(stock, start, end),
start, end),
rawStocks))
# merge two factors, trim each factor into a time region
# and fill up the missing values
allfactors = factors1 + factors2
factors = list(map(lambda stock:
fillInHistory(
trimToRegion(stock, start, end),
start, end),
allfactors
))
stocksReturns = list(map(twoWeekReturns, stocks))
factorsReturns = list(map(twoWeekReturns, factors))
# transpose factorsReturns
factorMat = transpose(factorsReturns)
# featurize each row of factorMat
factorFeatures = list(map(featurize, factorMat))
# OLS require parameter is a numpy array
factor_columns = np.array(factorFeatures)
#add a constant - the intercept term for each instrument i.
factor_columns = sm.add_constant(factor_columns, prepend=True)
# estimate weights
weights = [estimateParams(stockReturns, factor_columns) for stockReturns in stocksReturns]
factorCov = np.cov(factorsReturns)
factorMeans = [sum(u) / len(u) for u in factorsReturns]
# Compare the real factor with its approximation
if (plotDistrib):
samples = (
distribFunc(factorMeans, factorCov, len(factorsReturns[0])).T if (df is None) else distribFunc(factorMeans, factorCov, len(factorsReturns[0]), df=df).T
)
compare(samples[0], factorsReturns[0], "Two-week Crude oil returns distribution")
compare(samples[1], factorsReturns[1], "Two-week Treasury bonds returns distribution")
compare(samples[2], factorsReturns[2], "Two-week GSPC returns distribution")
compare(samples[3], factorsReturns[3], "Two-week IXIC returns distribution")
parallelism = 12
trial_indexes = list(range(0, parallelism))
seedRDD = sc.parallelize(trial_indexes, parallelism)
bFactorWeights = sc.broadcast(weights)
trials = seedRDD.flatMap(lambda idx: \
simulateTrialReturns(
max(int(numTrials/parallelism), 1),
factorMeans, factorCov,
bFactorWeights.value,
distribFunc,
df
))
trials.cache()
valueAtRisk = fivePercentVaR(trials)
conditionalValueAtRisk = fivePercentCVaR(trials)
varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
kupiecTestPValueReturn = kupiecTestPValue(stocksReturns, valueAtRisk, 0.05)
print("Value at Risk(VaR) 5%:", valueAtRisk)
print("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
print("VaR confidence interval: " , varConfidenceInterval)
print("CVaR confidence interval: " , cvarConfidenceInterval)
print("Kupiec test p-value: " , kupiecTestPValueReturn)
return kupiecTestPValueReturn
pValue100 = invest(100)
pValue2000 = invest(2000)
def multivariate_t_rvs(m, S, n=1, df=np.inf):
'''generate random variables of multivariate t distribution
Code soure: https://stackoverflow.com/questions/41957633/sample-from-a-multivariate-t-distribution-python
Parameters
----------
m : array_like
mean of random variable, length determines dimension of random variable
S : array_like
square array of covariance matrix
df : int or float
degrees of freedom
n : int
number of observations, return random array will be (n, len(m))
Returns
-------
rvs : ndarray, (n, len(m))
each row is an independent draw of a multivariate t distributed
random variable
'''
m = np.asarray(m)
d = len(m)
if df == np.inf:
x = 1.
else:
x = np.random.chisquare(df, n)/df
z = np.random.multivariate_normal(np.zeros(d),S,(n,))
return m + z/np.sqrt(x)[:,None] # same output format as random.multivariate_normal
pValueTRVS35 = invest(35, distribFunc = multivariate_t_rvs, plotDistrib = True, df = 4)
def multivariate_lognormal(mean, cov, n=1):
mean_normal = np.zeros(len(mean))
cov_normal = np.zeros((len(cov), len(cov[0])))
for i in range(cov_normal.shape[0]):
for j in range(cov_normal.shape[1]):
if (mean[i] * mean[j] != 0):
cov_normal[i][j] = math.log(1 + cov[i][j] / (mean[i] * mean[j]))
for i in range(mean_normal.shape[0]):
if (mean[i] > 0):
mean_normal[i] = math.log(mean[i]) - cov_normal[i][i] / 2
return np.random.multivariate_normal(mean_normal, cov_normal, n)
pValueLogNormal35 = invest(35, distribFunc = multivariate_lognormal, plotDistrib = True)
def simulateTrialReturnsGMM(numTrials, factorsReturns, weights, n_components=1, covariance_type='tied', tol=0.001):
trialReturns = []
for i in range(0, numTrials):
# generate sample of factors' returns
models = GaussianMixture(n_components=n_components, covariance_type=covariance_type, tol=tol).fit(np.array(factorsReturns).T)
trialFactorReturns = models.sample(1)[0].reshape(len(factorsReturns))
# featurize the factors' returns
trialFeatures = featurize(np.matrix.tolist(trialFactorReturns))
# insert weight for intercept term
trialFeatures.insert(0,1)
trialTotalReturn = 0
# calculate the return of each instrument
# then calulate the total of return for this trial features
trialTotalReturn = np.sum(np.dot(weights, trialFeatures))
trialReturns.append(trialTotalReturn)
return trialReturns
def investGMM(numberOfStocks, numTrials = 10000, n_components=1, covariance_type='tied', tol=0.001):
# select path of all stock data files in "stock_folder"
files = [join(stock_folder, f) for f in listdir(stock_folder) if isfile(join(stock_folder, f))]
# assume that we invest only the first 'numberOfStocks' stocks (for faster computation)
files = files[:numberOfStocks]
# read each line in each file, convert it into the format: (date, value)
rawStocks = [process_stock_file(f) for f in files]
# select only instruments which have more than 5 years of history
# Note: the number of business days in a year is 260
number_of_years = 5
rawStocks = list(filter(lambda instrument: len(instrument) >= 260 * number_of_years , rawStocks))
# note that the data of crude oil and treasury is only available starting from 26/01/2006
start = datetime(year=2009, month=1, day=23)
end = datetime(year=2014, month=1, day=23)
# trim into a specific time region
# and fill up the missing values
stocks = list(map(lambda stock:
fillInHistory(
trimToRegion(stock, start, end),
start, end),
rawStocks))
# merge two factors, trim each factor into a time region
# and fill up the missing values
allfactors = factors1 + factors2
factors = list(map(lambda stock:
fillInHistory(
trimToRegion(stock, start, end),
start, end),
allfactors
))
stocksReturns = list(map(twoWeekReturns, stocks))
factorsReturns = list(map(twoWeekReturns, factors))
# transpose factorsReturns
factorMat = transpose(factorsReturns)
# featurize each row of factorMat
factorFeatures = list(map(featurize, factorMat))
# OLS require parameter is a numpy array
factor_columns = np.array(factorFeatures)
#add a constant - the intercept term for each instrument i.
factor_columns = sm.add_constant(factor_columns, prepend=True)
# estimate weights
weights = [estimateParams(stockReturns, factor_columns) for stockReturns in stocksReturns]
models = GaussianMixture(n_components=n_components, covariance_type=covariance_type, tol=tol).fit(np.array(factorsReturns).T)
samples = models.sample(len(factorsReturns[0]))[0].T
compare(samples[0], factorsReturns[0], "Two-week Crude oil returns distribution")
compare(samples[1], factorsReturns[1], "Two-week Treasury bonds returns distribution")
compare(samples[2], factorsReturns[2], "Two-week GSPC returns distribution")
compare(samples[3], factorsReturns[3], "Two-week IXIC returns distribution")
parallelism = 12
trial_indexes = list(range(0, parallelism))
seedRDD = sc.parallelize(trial_indexes, parallelism)
bFactorWeights = sc.broadcast(weights)
bfactorsReturns = sc.broadcast(factorsReturns)
trials = seedRDD.flatMap(lambda idx: \
simulateTrialReturnsGMM(
max(int(numTrials/parallelism), 1),
bfactorsReturns.value,
bFactorWeights.value,
n_components,
covariance_type,
tol
))
trials.cache()
valueAtRisk = fivePercentVaR(trials)
conditionalValueAtRisk = fivePercentCVaR(trials)
varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
kupiecTestPValueReturn = kupiecTestPValue(stocksReturns, valueAtRisk, 0.05)
print("Value at Risk(VaR) 5%:", valueAtRisk)
print("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
print("VaR confidence interval: " , varConfidenceInterval)
print("CVaR confidence interval: " , cvarConfidenceInterval)
print("Kupiec test p-value: " , kupiecTestPValueReturn)
return kupiecTestPValueReturn
pValueGMM35 = investGMM(35, n_components=2, covariance_type='full', tol=1e-3)
In this lecture, we studied the Monte Carlo Simulation method and its application to estimate financial risk. To apply it, first, we needed to define the relationship between market factors and the instruments' returns. In such step, you must define the model which maps the market factors' values to the instruments' values: in our use case, we used a linear regression function for building our model. Next, we also had to find the parameters of our model, which are the weights of the factors we considered. Then, we had to study the distribution of each market factor. A good way to do that is using Kernel density estimation to smooth the distribution and plot it. Depending on the shape of each figure, we had to guess the best fit distribution for each factor: in our use case, we used a very simple approach, and decided that our smoothed distributions all looked normal distributions.
Then, the idea of Monte Carlo simulation was to generate many possible values for each factor and calculate the corresponding outcomes by a well-defined model in each trial. After many trials, we were able to calculate VaR from the sequences of outcome's values. When the number of trials is large enough, the VaR converges to reasonable values, that we could validate using well-known statistical hypothesis.